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Abstract — This paper propose a decoder architecture for low- 
density parity-check convolutional code (LDPCCC). Specifically, 
the LDPCCC is derived from a quasi-cyclic (QC) LDPC block 
code. By making use of the quasi-cyclic structure, the proposed 
LDPCCC decoder adopts a dynamic message storage in the mem- 
ory and uses a simple address controller. The decoder efficiently 
combines the memories in the pipelining processors into a large 
memory block so as to take advantage of the data-width of the 
embedded memory in a modern field-programmable gate array 
(FPGA). A rate-5/6 QC-LDPCCC has been implemented on an 
Altera Stratix FPGA. It achieves up to 2.0 Gb/s throughput with 
a clock frequency of 100 MHz. Moreover, the decoder displays 
an excellent error performance of lower than 10~ 13 at a bit- 
energy-to-noise-power-spectral-density ratio (Eb/No) of 3.55 dB. 

Index Terms — Decoder architecture, FPGA implementation, 
LDPC convolutional code, QC-LDPC convolutional code 



I. Introduction 

Low-density parity-check (LDPC) codes, first invented by 
Gallager in 1960's (TJ, have been found to be capable of 
approaching the channel capacity. Later, LDPC convolutional 
codes (LDPCCCs) have been shown to outperform LDPC 
block codes in terms of error performance (e.g., lower error 
floors and higher coding gains) under a similar decoding com- 
plexity |2). The comparisons between LDPCCCs and LDPC 
block codes from the perspectives of hardware complexity, 
delay requirements, memory requirements have been discussed 
in G) and ffl. 

LDPCCC has inherited the basic structure of convolutional 
code and enables a continuous encoding and decoding of mes- 
sages of varying lengths. Such a property has made LDPCCC 
a promising solution in many applications. When designing 
an LDPCCC for an application, furthermore, many factors 
such as code rate, sub-block length, coding gain, throughput, 
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error performance and the encoder/decoder complexity may 
have to be taken into consideration. High data rate optical 
communications require powerful error correction codes with 
low redundancies to achieve an error floor lower than a bit 
error rate (BER) of 1(T 13 , preferably 1CT 15 0, 0. Motivated 
by such applications, the goal of this work is to design and 
implement an efficient decoder architecture such that codes can 
achieve high throughput, high coding gain, high code rate and 
low error floor. 

Designing high-throughput decoder architectures for LDPC 
block codes has been extensively studied. In 171. a high- 
throughput memory-efficient decoder architecture that jointly 
optimizes the code design, the decoding algorithm and the 
architecture level has been proposed. A practical coding 
system design approach has been presented in JHJ whereby 
the LDPC codes are constructed subject to decoder hardware 
constraints. Simulation results have shown that the codes 
constructed suffer from only minor performance loss compared 
with unconstrained ones. In GO, a quasi-cyclic LDPC (QC- 
LDPC) decoder architecture that achieves a throughput of 172 
Mbps has been studied. The high throughput is achieved by 
reducing the critical path through modifying the decoding al- 
gorithm as well as the check-node and variable-node processor 
architectures. In ifTol . the throughput of a QC-LDPC decoder 
is further improved by parallelizing the processing of all layers 
in layered decoding. Subsequently, the decoder can achieve a 
maximum throughput of 2.2 Gbps with an operating frequency 
of 950 MHz and 10 min-sum decoding iterations. In ifTTl . 
the authors have proposed a high-speed flexible shift-LDPC 
decoder that can adapt to different code lengths and code 
rates. The decoder employs the Benes network to handle the 
complicated interconnections for various code parameters. It 
adopts the single-minimum min-sum decoding and achieves a 
throughput of 3.6 Gbps with an operating frequency of 290 
MHz. 

Although LDPCCC decoders may "borrow" some design 
techniques used in the LDPC block decoder architectures, 
overall they are very different from the block code counterp arts 
due to the distinct code construction mechanism and unique 
characteristics of LDPCCCs. High-throughput LDPCCC de- 
coder architectures based on parallelization have been studied 
in lfl2l . |[T3ll . Such architectures can achieve a throughput 
of over 1 Gbps with a clock frequency of 250 MHz. They, 
however, are confined to time-invariant LDPCCCs and cannot 
be easily applied to time-varying ones, which usually produce 
a better error performance. In iTFfl . a register-based decoder 
architecture attaining up to 175 Mbps throughput has been 
proposed. This architecture has successfully implemented a 
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pipeline decoder with 10 processing units. Nonetheless, its 
register-intensive architecture has limited its power efficiency. 
In |fT31 , fl6l , a low-cost low -power memory -based decoder 
architecture that uses a single decoding processor has been 
proposed. On one hand, the serial node operation uses a 
small portion of the field-programmable gate array (FPGA) 
resources. On the other hand, such a design has posed a sig- 
nificant limitation on the achievable throughput. Subsequently, 
the memory-based designs with parallel node operations have 
been proposed and have led to a substantial improvement 
in throughput lfT7l - |[T9l . The high throughput accomplished 
under these designs, however, is achieved at the cost of a 
complicated switch network. 

To the best of the authors' knowledge, the previously 
proposed LDPCCC decoder architectures mainly handle ran- 
dom time-varying LDPCCCs. In this paper, we propose a 
decoder architecture for LDPCCCs with regular structures. In 
particular, the proposed decoder caters for a class of LDPCCCs 
that have a quasi-cyclic structure and can be derived from 
a QC-LDPC block code ll20l . The motivation of considering 
codes with regular structures is twofold. First, LDPCCCs with 
regular structures have recently attracted much interest both 
theoretically and empirically lETI . El . Second, following the 
insights from LDPC block codes, regular codes can make the 
decoder structure much simpler and at the same time achieve 
good error performance. Therefore, developing an efficient 
architecture for regular codes is of high importance in practice. 

The contributions in our paper are distinct from previous 
works in many aspects including complexity, throughput, relia- 
bility and scalability. Firstly, we eliminate all switch networks, 
which are included in most of the previous implementations 
and are very complex for a high-rate LDPCCC. Instead, we 
propose the use of dedicated block processing units, with 
which we can provide higher throughput with similar decoder 
complexity. Second, the quantized sum-product algorithm 
(QSPA) applied in our LDPCCC decoder is more reliable 
compared with the min-sum-based LDPCCC decoder, i.e., 
QSPA outperforms the min-sum-based decoder in terms of 
error performance. Furthermore, our proposed QSPA imple- 
mentation has a complexity only linearly proportional to the 
check-node degree. Third, it is known that more decoding 
iterations can enhance the error performance of the decoder. In 
our decoder design, each decoding iteration is accomplished 
by one processor and the processors are serially connected. 
Our decoder architecture also enables us to change the number 
of processors easily without re-designing the whole decoder. 
Thus, our decoder is scalable in terms of the number of 
processors. We have implemented our decoder architecture for 
a rate 5/6 LDPCCC in an Altera Stratix FPGA. The decoder 
has produced a throughput of 2.0 Gbps with a clock running 
at 100 MHz. Moreover, the LDPCCC has an excellent error 
performance, achieving an error of lower than 10~ 13 at a bit- 
energy-to-noise-power-spectral-density ratio (Eb/No) of 3.55 
dB. 

The rest of the paper is organized as follows. Section [II] 
reviews the construction of QC-LDPCCCs and the decoding 
process for such codes. Section Hill describes the proposed de- 
coder architecture and pipeline schedule. Section HVl presents 



the implementation complexity of the decoder architecture. 
The FPGA simulation results are also presented in this section. 
Finally, Section [V] concludes the paper. 

II. Review of LDPC Convolutional Codes 

A. Structures of LDPCCC and QC-LDPCCC 

The parity-check matrix of an unterminated time-varying 
periodic LDPCCC is shown in ([T]i where m s is termed as 
the memory of the parity-check matrix; and Hi(t), i = 
0, 1, • • ■ , m s , are (c — b) x c sub-matrices with full rank. An 
LDPCCC is periodic with period T if H;(i) = H»(t + T) 
for all i = 0, 1, • • • , m s . If T = 1, the code is time-invariant; 
otherwise, it is time-varying. The code rate of the LDPCCC 
is given by R = b/c. Moreover, a coded sequence V[ oc ] = 
[v ,vi,--- ,] with v t = [v t ,i,v tt2 , ■ • ■ ,v t , c ] (t = 0,1,2,...) 
satisfies 
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Given a quasi-cyclic LDPC (QC-LDPC) block code with a 
base matrix of size n c x n v and an expansion factor of z l23l . 
we can construct a QC- LDPCCC0 as follows. 

1) Expand the parity-check matrix of the QC-LDPC block 
code into a zn c x zn v matrix H & . 

2) Represent the zn c x zn v parity-check matrix H b as a 
M x M matrix, where M is the greatest common divisor 
of n c and n v , i.e., M = gcd(n c ,n v ). Then we have 
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where ^ is a ^ff x ^ff matrix, for i, j = 1, 2, ■ • • , M. 
3) Split H b into Hf and H b which correspond to the lower 
triangular part and the strictly upper triangular part of 
H & , respectively. Hf and H b are therefore denoted, 
respectively, by 
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4) Unwrap the parity-check matrix of the block code to 
obtain the parity-check matrix of a QC-LDPCCC in the 
form of i.e., 



Hf; 



Hf 



H„ 



(3) 



'We define a QC-LDPCCC as an LDPCCC in which all the elements H;(t) 
in the parity-check matrix H are composed of identity matrices, cyclic-right- 
shifted identity matrices or zero matrices. 
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The above construction process is illustrated in Fig. Q] By 
comparing (Q~|i and ||3}, it can be observed that the period of 
the QC-LDPCCC is T = M and the memory m s satisfies 
M = m s + l. It can also be observed that the relative positions 
between the variable nodes and the check nodes do not change. 
Hence the girth of the QC-LDPCCC is no less than that of 
the original QC-LDPC block code J24). Therefore, we can 
construct a large-girth QC-LDPCCC by first designing the sub- 
matrices to obtain a large-girth QC-LDPC block code and then 
performing the unwrapping operation. 

B. Decoding Algorithm for LDPCCC 

LDPCCC has an inherent pipeline decoding process |2) . 
The pipeline decoder consists of I processors, separated by 
c(m s + 1) code symbols, with / being the maximum number 
of decoding iterations. Throughout the decoding process, we 
assume that messages in log-likelihood-ratio (LLR) form are 
being used. 

At the start of each decoding step (say at time to), the 
incoming channel messages associated with the c new variable 
nodes v to = [v to ,i, v to ,2, • • • , v t ,c] enter the first processor. 
Moreover, the corresponding variable-to-check messages for 
these variable nodes have the same values as the incoming 
channel messages. At the same time, the messages associated 
with the variable nodes v to _j( ms+1 ) are shifted from the i-th 
processor to the (i+l)-th processor, where i = 1,2, ••• ,1—1. 
Then, each processor updates the (c — b) check nodes corre- 
sponding to the (to — (i — l)(m s + l))-th block row of Hro i00 i 
in (Q]i using 

a m „ = 2tanh- 1 I J] tanh (^f-\ (4) 

where a mn is the check-to-variable message from check node 
m to variable node n; (3 mn is the variable-to-check message 
from variable node n to check node m\ J\f(m) is the set of 
variable nodes connected to check node m; and J\f(m)\n is 
the set AT(m) excluding variable node n. Next, the processors 
perform variable-node updating for v to _(j_ 1 )( ms+1 - ) _„ ls , i = 
1,2,..., I, using 

where X n is the channel message for variable node n; M.(n) 
is the set of check nodes connected to variable node n; and 



Ai(n)\m is the set Ai(n) excluding check node m. Finally, 
the a posteriori probabilities (APPs) for the c variable nodes 
v t-(/-i)(m 3 +i)-m a leaving the last processor are computed 
using 

fin — A n + ^2 a m'n, (6) 
m'EM(n) 

based on which the binary value of each individual variable 
node is determined. 

Thus, each decoding step consists of inputting new channel 
messages to the decoder, shifting messages, updating check- 
to-variable messages, updating variable-to-check messages, 
computing APPs and decoding the output bits. As a result, 
after an initial delay of (m s + 1)/ decoding steps, there is a 
continuous output of the decoded bits. 

III. Decoder architecture 

In the hardware design of an LDPCCC decoder, the pro- 
cessor complexity, memory requirement, throughput and error 
performance are closely related. It is worthwhile to study 
their tradeoffs so as to design a decoder meeting the appli- 
cation requirements. Following the notations presented in the 
construction of a QC-LDPCCC, we can roughly characterize 
the factors affecting the decoder as follows. Suppose the 
decoding process is divided into G stages. A smaller G 
provides a higher level of parallelism that the decoder can 
achieve. The error performance of an LDPCCC improves as z 
increases and/or / increases and/or R decreases. Furthermore, 
the information throughput is proportional to zR/G while 
the memory usage is proportional to zlri^(l — R). Also, 
the processor complexity in terms of combinational logics 
is proportional to z/n^(l — R)/G. More details about the 
complexity of memory usage are shown in Section IIII-BI 

It can be seen that the error performance of an LDPCCC 
can generally be improved at the cost of a higher processor 
complexity, more memory usage or a lower throughput. For 
instance, with the sub-matrix size z x z fixed, as the code 
rate R decreases, the error performance becomes better at the 
cost of a lower information throughput. Furthermore, both the 
processor complexity and the memory requirement become 
higher due to an increase in the number of check nodes. 
With the code rate and the throughput fixed, as the sub-matrix 
size increases, the error performance improves with the same 
processor complexity but more memory usage. The experiment 
results presented in Section [IV] will provide a rough guideline 



Fig. 1. Illustration of constructing a QC-LDPCCC from a QC-LDPC block code. 



on how to choose the parameters in order to achieve a targeted 
error performance, processor complexity and memory usage. 

In most of the previous works, a generic processing unit 
such as that shown in Fig. |2a) is applied in the LDPCCC 
decoder. For this type of design, a switch network and some 
corresponding control logics are required. The complexity 
overhead of the switch network is not a concern in the previous 
works mainly because the number of edges between the check 
nodes and the variable nodes is small. When the number of 
edges between the check nodes and the variable nodes is 
large, e.g., for a high-throughput and high code-rate LDPCCC, 
the routing and hardware complexity of the switch network 
becomes a critical issue. 

In our proposed decoder, we use dedicated Block Pro- 
cessing Units (BPUs) instead of generic processing units. 
Consequently, the complexity of routing and switching the 
messages are no longer required i.e., the complex switch 
network is eliminated. As shown in Fig. [2fb), we use M 
BPUs in one processor. One BPU is used during each decoding 
step of one codeword and M BPUs are used to facilitate the 
pipeline of M distinct codewords simultaneously. In general, 
our approach can obtain a M times speed-up in throughput 
with the pipeline of M distinct codewords. Details will be 
described in Section UlI-CI 

A. Architecture Design 

A high-throughput decoder requires parallel processing of 
the LDPCCC. We propose a partially parallel decoder ar- 
chitecture that utilizes parallelization on both the node level 
and the iteration level. The number of rows and the number 
of columns of the sub-matrices ■ in (f2]i (corresponding 
to Hi(t) in (Q3) are c — b = zn c /M and c = zn v /M, 
respectively. Our proposed decoder architecture is illustrated 
in Fig. [3] The decoder consists of I processors where / 
is the maximum number of decoding iterations. Since the 
memory of a QC-LDPCCC constructed using the method 




(bj Dedicated Block Processing Units 



Fig. 2. Generic Processing Unit and Dedicated Block Processing Unit. 



in Section [TT] is m s = M — 1, the variable nodes and the 
check nodes in each processor are separated by a maximum 
of M — 1 time instants. Denote the c — b check nodes and the 
c variable nodes that enter a particular processor by Ut = 
[ut ,i,u to ,2, ■ • • ,u to , c -b\ and v to = [v t0t i, v to ,2, ■ • • , v to , c }, 
respectively. Then the check nodes and the variable nodes that 
are about to leave the processor are given by u to _M+i = 

Fto-M+1,1) Uto-M+1,2) " ' ' > u t -M+l,c-b] an dv to _ A / +1 = 

ko-M+1,1, Uto-M+i,2) ■ • ' , v to -M+i,c], respectively. At each 
decoding step, a BPU is responsible for processing the check 
nodes that enter the processor (i.e., u to ) and the variable nodes 
that are about to leave the processor (i.e., v to „M+i)- 

At the start of each decoding step, c — b check nodes are 
to be processed. We divide them into G groups and conse- 
quently we divide a complete decoding step into G stages. 
At the i-th stage (i = 1, 2, • • • , G), (c — b)/G check nodes 

u t ,{i-l)(c-b)/G+li u t ,(i-l)(c-b)/G+2, ' ' • > «t i»(c-6)/G] are 

processed in parallel. The variable-to-check messages ex- 
pressed in the sign-and-magnitude format are input to a 
group of (c — b)/G check-node processors (CNPs). Among 
the resulting check-to-variable messages, those between the 
check nodes in Ui and the variable nodes not in the set 
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Fig. 3. Block diagram of the pipeline processors in the LDPCCC decoder. 



v t -A/+i w iU be written to the local RAMs, waiting to be 
further processed by other BPUs. On the other hand, the 
updated check-to-variable messages between the check nodes 
in u to and the variable nodes in v to _j\/+i are converted 
to the format of 2's complement before being processed by 
the variable-node processor (VNP). Since each check node 
is connected to a total of c/z variable nodes in v to -M+i, 
((c— b)/G) x (c/z) = c(c — b)/Gz variable nodes in v 4o _m+i 
are connected to the newly updated check nodes and hence 
c(c—b)/Gz VNPs are needed in one BPU. Finally, the updated 
variable-to-check messages are converted back to the format 
of sign-and-magnitude and they will be shifted to the next 
processor together with their associated channel messages in 
the next decoding step. 

In the BPUs, the CNPs update the check nodes accord- 
ing to However, in practical implementations we need 
to quantize the messages to reduce the complexity. In our 
implementation, we adopt a four-bit quantization, where the 
quantization step is derived based on density evolution 11251 
and differential evolution 11261 . Empirical results show that its 
error performance is only 0. 1 dB worse than the floating-point 
sum-product algorithm (SPA). 

We consider a check node with degree d. For a full 
quantized-SPA (QPSA) implementation, there should be d 
inputs, each of length 4-bits. Consequently, the size of the 
look-up table (LUT) becomes 2 4d , which equals 2 96 (as we 



use d c — 24) in our design. We can observe that it is 
impractical to implement such an enormous LUT. Here, we 
propose to implement the CNP with quantization (QSPA) by 
first pairing up the input messages and then calculating the 
extrinsic messages excluding the input itself. More specifically, 
suppose the variable nodes connected to check node m is listed 
as [ni, 7T-2, . . . , rid] and the corresponding input messages are 
denoted by [si, s%, ■ ■ ■ , s d ]. The updated check-to- variable 
message to variable node rii is then calculated as 

Q{a mni } = 0(s i -,s i+ ) (7) 

where 

0(i,j) = g|2tanh~ 1 (^tanh ^ tanh | (8) 

Si - = 0(0{0(s 1 ,s 2 ),s 3 ),---s i - 1 ) (9) 
s l+ = 0(0(0(s d ,s d - 1 ),s d - 2 ),---s i+ i). (10) 

Thus, (0 can be implemented based on a simple LUT tree, 
as shown in Fig. [4] In fact, it can be easily verified that 
each LUT is of size 2 s = 256 and the total number of units 
required is always 2d = 48. Thus, our proposed tree-structured 
implementation ensures that the CNP complexity remains low, 
namely in 0(d c ). Moreover, the VNP is basically an adding 
operation which can be implemented using an adder tree. 
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B. Memory storage 

For clarity of presentation, we first assume M — n c . Hence 
we have c — b = z and c = zn v /n c . As mentioned earlier, 
we divide the decoding step into G stages with z/G check 
nodes being processed in parallel. We consider the t.Q-th block 
row of Hj^ ^ shown in Fig. Q] This block row consists of 
1 x (n v /n c ) sub-matrices, each having a size of z x z. Thus, 
this block row corresponds to z check nodes and zn v /n c 
variable nodes in the Tanner graph. We also assume that the 
1 x (n v /n c ) sub-matrices are either the identity matrix or 
cyclic -right-shifted identity matrices. Suppose u to and v to just 
enter a particular processor and u to -M+i and v to _M+i are 
about to be shifted out of the same processor. The memory 
requirement is explained as follows. 

1) Storage of check-to-variable and variable-to- 
check messages: We denote the check nodes by 
u to = [u toi i,u to ,2,---,Ut ,z]. We further divide them 
into G groups with the i-th group being denoted by 

[ u t ,l + (i-l)z/Gi u t ,2+{i-l)z/Gi ■ ■ ■ j u ta,z/G+(i-l)z/G] ( l = 

1,2, ...,G), As explained previously, in processing u to , 

[Ut ,l + (i-l)z/Gi u t ,2+(i-l)z/G, ■ • ■ , u t a ,z/G+(i-l)z/G\ are 

processed in parallel at the i-th stage of a decoding step. 
Therefore in order to avoid the collisions of memory access, 
z/G different RAMs are needed for storing the z /G messages 
on the edges if each of the z/G check nodes is connected 
to only one variable node. From the construction of the QC- 
LDPCCC, moreover, each check node has a regular degree of 
n v , i.e., each check node is connected to n v variable nodes. 



Consequently, a total of zn v /G RAMs are needed for storing 
the edge-messages passing between the check nodes in u to 
and their connected variable nodes to avoid the collisions 
of memory access. Further, each processor has M sets of 
such check nodes, i.e., u to , u to _i, . . . , u to _A/ + i. As a result, 
zn v M/G RAMs are allocated in one processor to store the 
edge-messages, i.e., check-to-variable or variable-to-check 
messages. In addition, the data-depth and the data-width of 
the RAMs are equal to G and the number of quantization 
bits, respectively. 

2) Storage of channel messages: For the channel messages, 
the memory storage mechanism is similar. The set of z 
variable nodes corresponding to every z x z sub-matrix are 
first divided into G groups. Then z/G RAMs, each of which 
having G entries, are allocated to store the channel messages. 
Moreover, the variable nodes in v to correspond to n v /n c sub- 
matrices and each processor contains M variable-node sets 
denoted by v to , v to _i, . . . , v to _j\/+i- Consequently, a total 
of zn v M/n c G = zn v /G RAMs are allocated to store the 
channel messages in one processor. The data-depth and the 
data-width of the RAMs are equal to G and the number of 
quantization bits, respectively. 

For a general case where M is not necessarily equal to n c , 
zn c n v /G RAMs are needed to store the edge-messages and 
zn v M/n c G RAMs are required to store the channel messages 
in one processor. In modern FPGAs, the total number of 
internal memory bits is usually sufficient for storing the mes- 
sages of codes with a reasonable length and with a reasonable 
number of decoding iterations. However, the number of RAM 
blocks is usually insufficient. Note that the operations of the 
pipeline processors are identical, the connections between the 
RAMs and the BPUs are the same and the addresses of 
accessing the RAMs are the same. By taking advantage of the 
homogeneity of the processors, we can combine the RAMs in 
different processors into one large RAM block. In particular, 
for the RAMs handling edge-messages, we can combine the / 
sets of zn c n v /G RAM blocks distributed in the / processors 
into one set of zn c n v /G RAM blocks. Similarly, for the 
RAMs storing the channel messages, / sets of zn v M/n c G 
RAM blocks are combined into one set of zn v M/n c G RAM 
blocks. The data-depth of the RAMs remains the same while 
the data-width becomes / times wider. Note that the memory 
combination is a unique feature of LDPCCC and is not boasted 
by LDPC block codeQ 

Another advantage of such a memory storage mechanism 
is that the address controller is a simple counter incrementing 
by one at every cycle, thanks to the quasi-cyclic structure. 
Specifically, at the start of each decoding step, the addresses 
of accessing the RAMs are initialized based on the parity- 
check matrix H?g M j, As the decoding process proceeds, the 
addresses are incremented by one after every stage, until all 
G stages are completed. 



2 F or block codes, sophisticated memory optimization has been proposed in 
1271 . High complexity is involved and memory efficiency is achieved at the 
cost of a lower throughput. 
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C. Pipeline scheduling 

Conventional LDPCCC decoder architectures (JT3] 02] flH] 
adopt the pipeline design shown in Fig. [5] Each processor se- 
quentially does the following: shift the messages in, update the 
check nodes, write the data to memories, input the messages 
to VNP and update the variable nodes. This pipeline schedule 
only utilizes pipelining on the iteration level following the 
standard decoding process. In this paper, we propose a more 
efficient pipeline scheduling based on our dynamic memory 
storage structure. 

We first describe the pipeline schedule for a single code- 
word. Instead of writing the updated messages from CNP 
and those from VNP in two separate stages, we combine 
them with the shifting operation. The updated messages from 
VNP and the channel messages associated with the updating 
variable nodes are directly output to the next processor, which 
completes the writing and shifting operations at the same time. 
Since some of the updated messages from CNP need not be 
processed by VNP, they are written to the local memories at 
the same time. Note that the memory locations into which 
the messages are shifted are exactly those storing the original 
messages loaded by the BPU. Therefore, there would not have 
any memory collisions during the process. 

It can also be inferred from this process that the types of 
messages stored in the memories are dynamically changing. 
The messages associated with u to are all variable-to-check 
messages by the time u to first enters a processor and is ready 
to be processed by CNP. After each decoding step, some of 
the messages are substituted by the updated variable-to-check 
messages from the previous processor. When M decoding 
steps are completed, all the check-to-variable messages orig- 
inally associated with u to will be completely substituted by 
variable-to-check messages. Yet, they are now messages for 
U( +a/+i and are ready for CNP in a new round of decoding. 

Figure |6(a)| describes the pipeline for a single codeword 
assuming G = 3 and M = 4. Comparing Fig. [5] and Fig. |6(a)| 
it can be observed that decoding a group of check nodes 
using the proposed pipeline scheduling only takes 4/7 of the 
time cost in conventional scheduling. The homogeneity of the 
pipeline processors also facilitates a pipeline processing of 
multiple codewords. As shown in Fig. |6(a)| where a single 
codeword is being decoded, the processing time of different 
BPUs are separated in the sense that while one BPU is 
processing, the other BPUs remain idle. To further increase 
the throughput, we can schedule other BPUs to process other 
codewords. Since the total number of blocks in a processor is 
M, we can incorporate a maximum of M different codewords 
in one processor, i.e., allowing BPU; to process Codeword-z, 
for i = 1, 2, • • • , M. Depending on the number of codewords 
incorporated, the throughput can be increased by a factor of 
M at the cost of additional memory storage and additional 
hardware complexity of the BPUs. Figure [6(b)] illustrates the 
pipeline schedule for four codewords with G = 3 and M = 4. 

Using our proposed pipeline schedule, the throughput of 
the decoder is (n v — n c )z/M information bits for every G + 
d cycles, where d is the time delay for each pipeline stage 
such that G + d cycles are used by one BPU. As there are 



more decoding stages, i.e., G increases, the throughput tends 
to (n v — n c )zf/MG bits/s with a running clock of / Hz. 

An illustrative example of the RAM storage and decoding 
process 

Example: we consider a QC-LDPCCC with G = 2, 
z = 4, n c = 2 and n v = 4. Since M - gcd(n c ,n v ) = 
2, each processor has M = 2 BPUs. In each proces- 
sor, zn c n v /MG = 8 RAMs are dedicated to store edge- 
messages and zn v /n c G = 4 RAMs are dedicated to store 
channel messages. Assume that the check nodes u to = 
[ut 0i i, Wt ,2j ■ • ■ , u t ,i\ j ust enter a processor and the variable 
nodes v to _i = [«t _i,i, Vt -i,2, • • • , Vt -i,s\ are about to 
leave. The decoding step of processing BPU; (i = 1,2) is 
divided into G = 2 stages. Figure|7]shows the dynamic storage 
of the edge-messages in the RAMs at different time instances. 

Step 1) It shows the RAM storage at the start of processing 
u to and v *o-i by BPUi. It can be seen that RAM 1 to 8 
store the variable-to-check messages for u to which is ready 
to be processed. RAM 13 to 16 store the latest check-to- 
variable messages for u to _i, which are updated in the previous 
decoding step by BPU2. RAM 9 to 12 store the variable- 
to-check messages that are newly updated in the previous 
decoding step and are shifted from the previous processor. 

Step 2) It shows the RAM storage after the first stage of 
BPUi processing. At the first stage, BPUi will process Ut ,i 
and u tfl .2 and their connected variable nodes in Vj _i, e.g., 
[vt -i,3,Vt -i,4,v to -i,5,v to -i,8]' CNP reads the variable-to- 
check messages from the first set of entries located in RAM 1 
to 8. The newly updated check-to- variable messages between 
u to and v tn from CNP are input to the first set of entries in 
RAM 1 to 4 (i.e., from where the check-to-variable messages 
are read), while the newly updated check-to-variable messages 
between u to and v to _i are input to the VNP and the resulting 
variable-to-check messages are shifted to the next processor. 
As a result, the updated variable-to-check messages between 
Vto+i aR d u *o+2 are written to RAM 5 to 8 and those between 
v to+ i and Ut 0+ i are written to RAM 13 to 16. 

Step 3) It shows the RAMs after the second stage of BPUi 
processing. At the second stage, BPUi will process ut 0j 3 
and u to .4 and their connected variable nodes in Vj _i, e.g., 
[vt -i,i,v ta -i,2,v to -i,6,Vt -i,7]. CNP reads the variable-to- 
check messages from the second set of entries located in RAM 
1 to 8. The newly updated check-to-variable messages between 
Ut and v to from CNP are input to the second set of entries in 
RAM 1 to 4 (i.e., from where the check-to-variable messages 
are read), while the newly updated check-to-variable messages 
between u to and v to _i are input to the VNP and the resulting 
variable-to-check messages are shifted to the next processor. 
As a result, the updated variable-to-check messages between 
v to+ i and u to+ 2 are written to RAM 5 to 8 and those between 
v to+ i and Ut 0+ i are written to RAM 13 to 16. 

The RAM updating at the decoding step of BPU2 is 
analogous to Steps 2) and 3) above. After the second stage of 
BPU2, RAM 1 to 8 will have the variable-to-check messages 
ready for u to+ 2 and their connected variable nodes in Vt 0+ i. 
The RAM storage is similar to that in Step 1) with the time 
instances incrementing by M = 2. A new round of BPUi 
updating will follow according to Steps 2) and 3). 
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| S I CNR | CNP | CNW | VNR | VNP | VNW | 

| S | CNR | CNP | CNW | VNR | VNP | VNW] 



Fig. 5. Conventional pipelining. S: Shift messages between processors; CNR: Input messages to CN; CNP: CN processing; CNW: Output messages from 
CN; VNR: Input messages to VN; VNP: VN processing; VNW: Output messages from VN. 
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(a) Single-codeword pipeline. 
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(b) Multiple-codeword pipeline. 



Fig. 6. Proposed Pipeline. Bi: processing of block i\ S-W: Shift messages and write messages to the next processor; R: Input messages to the block processing 
unit; CNP: check-node processing; VNP: variable-node processing. 



Also note that once the address controller is initialized at 
the start of the G stages, the read/write address of accessing 
the RAMs are simply incremented by 1. 

IV. Experimental Results 

We have implemented the QC-LDPCCC decoder on Al- 
tera Stratix IV. All the BER results for the QC-LDPCCC 
decoder are hence obtained from FPGA experiments under 
additive white Gaussian noise (AWGN) channels and 4-bit 
quantization. Based on a QC-LDPC block code with a 4 x 24 
base matrix, we construct QC-LDPCCCs of different sub- 
matrix sizes. Moreover, the sub-matrices of the block code 
are chosen such that the girth equals 8. Then we simulate the 



BER performance of the QC-LDPCCCs under different de- 
coding iteration numbers. Specifically, we have implemented 
LDPCCC decoders with the following parameters: (a) z = 422 
and I = 18; (b) z = 512 and I = 18; (c) z = 1024 and 
/ = 12; (c) z = 1024 and / = 10. Recall that z x z represents 
the sub-matrix size of each entry in the 4 x 24 base matrix 
while / denotes the number of iterations (i.e., processors) used 
in the LDPCCC decoders. 

Table U shows the hardware complexity of the decoders 
when combined with the noise generator. The complexities 
for a single-codeword implementation as well as a four- 
codeword pipeline implementation are shown. We observe that 
the hardware complexity increases as the code length and the 
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At the start of processing 
U (0 and V l0 _, by BPU 1: 



After the 1st stage of BPUi updating 

* cheek nodes U,-.o.i,U ( o.2 
* incident variable nodes VVi 3,V,h_i 4, 



After the 2nd stage of BPU, updating 

* cheek nodes U t oj,Uto,4 
* incident variable nodes V, .i.i,Vto.ij, 

V.o.^V.n.lJ 



After the 1st stage of BPU 2 updating 

* cheek nodes Uto^Um,* 
* incident variable nodes V, G . U ,V [(I . U . 



After the 2nd stage of BPU 2 updating 
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Fig. 7. Example of RAM storage, z = 4 and G = 2. 
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TABLE I 

Implementation complexity for QC-LDPCCC of different sub-matrix sizes. Code 1-S: z = 422, 1 = 18, single-codeword. Code 2-S: 
z = 512, I = 18, single-codeword. Code 3-S: z = 1024, 1 = 12, single-codeword. Code 4-S: a = 1024,7 = 10, single-codeword. Code 

1 -P: Z = 422, / = 18, FOUR-CODEWORD PIPELINE. CODE 2-P: Z = 512, I = 18, FOUR-CODEWORD PIPELINE. CODE 3-P: Z = 1024, I = 12, 
FOUR-CODEWORD PIPELINE. CODE 4-P: Z = 1024, I = 10, FOUR-CODEWORD PIPELINE. THE IMPLEMENTATION COMPLEXITY OF THE QC-LDPC 

BLOCK DECODER IN J5) IS SHOWN FOR COMPARISON. 
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Fig. 8. Bit-eiTor-rate (BER) results for the LDPCCCs with different sizes. 
The results are obtained from FPGA experiments under AWGN channels and 
4-bit quantization. 

number of processors increases. Figure further shows the 
BER results for the LDPCCCs. 

Based on Fig. [8] and Table HI we can see a tradeoff between 

(i) the BER performance, (ii) the code length and (iii) the 
number of processors (i.e., the number of iterations). We 
compare the performance of LDPCCC with z = 1024 but 
with different number of decoding iterations I. We can see 
that the LDPCCC with I = 12 is more than 0.1 dB better 
than that with I = 10 at a BER of 3 x 10~ 10 . We further 
compare the error performance of codes with similar processor 
complexity. We observe from Table U that the LDPCCC using 
z = 1024 and I = 12 has a similar complexity with the ones 
using (i) z = 422 and I = 18 or (ii) z = 512 and I = 18. 
Figure shows that the LDPCCC using z = 1024 and I = 12 
is outperformed by the ones using (i) z = 422 and / = 18 or 

(ii) z = 512 and 7=18, even though the latter two codes have 
smaller sub-matrix sizes. It is therefore obvious that a larger 
number of decoding iterations can help reducing the error rate 
even when a smaller sub-matrix size is used. In summary, we 
find that the number of decoding iterations plays an important 



role in the error performance of the LDPCCC. 

Based on the above results, the following guidelines can be 
used in designing a LDPCCC decoder. 

• To increase the decoder throughput while maintaining 
a similar BER performance and the same number of 
memory bits, we can reduce the memory depth G at the 
cost of more combinational logics. 

• To reduce the cost of combinational logics while main- 
taining a similar BER performance and throughput, we 
can increase z and use a smaller number of processors 
I. Under such circumstances, the total memory bits may 
increase. 

• To reduce the memory bits while maintaining a similar 
BER performance and throughput, we can use a smaller 
z and a larger I at the cost of combinational logics. 

In addition, we attempt to compare our implementation 
results with those found from the literature. Since the objective 
of our work is to achieve high throughput and good error 
performance, the code length and code rate of the codes used 
in our experiments are relatively large. While we can find 
quite a number of decoders in the literature, none of them 
consider codes with length comparable to the ones we use. 
All of them assume lengths which are relatively short and 
consequently they have high error floors and small coding 
gains. The "closest" one we can find is the QC-LDPC block 
decoder described by Wang and Cui J9), who target a high- 
speed decoder and adopt a length-8176 QC-LDPC code in the 
experiment. In Table I, we add the implementation results of 
the decoder in [91- Although the decoder in j9] seems to be less 
complex than our designs, its throughput (0.2 Gbps) is only 
1/10 of ours (2 Gbps). If 10 decoders in J5) are put together 
in order to achieve the same throughput as our decoders, the 
total complexity of the decoders will become larger than ours. 
Furthermore, the decoder in @ displays an error floor at a 
BER of 10~ 10 while our decoder does not. In fact, at a BER 
of 10 -10 , our decoders can achieve an extra coding gain of 
0.8 dB to 1 dB over the decoder in J9]- Thus, our proposed 
decoder is superior in achieving high throughput, high coding 
gain and low error floor. 

We also compare the BER performance of LDPCCCs and 
their block-code counterparts under similar processor com- 
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Fig. 9. Comparison of BER results between LDPCCCs and LDPC block- 
code counterparts under AWGN channels. The results of the LDPCCCs 
and the LDPC block codes are represented by solid lines and dashed 
lines, respectively. The results of the LDPCCCs are obtained from FPGA 
experiments under 4-bit quantization while and those of the LDPC block 
codes are obtained from computer simulations (using C programming) based 
on 4-bit quantized messages. 



plexity and throughput. Compared with a single-processor de- 
coder of an LDPC block code with the same iteration number 
/, the LDPCCC decoder with / processors, the length of the 
coded bits stored in each processor being the code length of 
the block code, incurs I times more complexity, but achieves / 
times higher throughput. In order for the LDPC block decoder 
to attain the same throughput, / times more processors are 
needed to decode in parallel. Under such circumstances, the 
overall complexity of the LDPC block decoder will increase 
by / times and becomes the same as the LDPCCC counterpart. 
Therefore, the fairness of comparing LDPCCC with its block- 
code counter part based on which the LDPCCC is derived is 
validated from the perspective of processor complexity and 
throughput. 

Figure [9] shows the BER performance of LDPCCCs and 
their block-code counterparts. The results of the LDPC block 
codes are obtained from computer simulations (using C pro- 
gramming) based on 4-bit quantized messages. It can be seen 
that the BER performance of LDPCCCs are generally superior. 
For instance, the LDPCCC with z = 422 and I = 18 has 
a gain of 0.2 dB at a BER of 2 x 1CT 5 over its block- 
code counterpart. Another observation is that the advantage 
of LDPCCC over its block-code counterpart becomes obvious 
as the number of decoding iterations increases. For example, 
the performance of LDPCCC with z = 1024 and / = 10 has 
a similar performance of its block-code counterpart at a BER 
of 2 x 10~ 5 ; and it outperforms its block-code counterpart by 
0.1 dB at a BER of xlO -6 when the number of decoding 
iterations increases to 12, i.e., / = 12. As a result, when the 
number of decoding iterations is large, LDPCCC is considered 
to be a better choice in terms of error performance. 
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V. Conclusion 

An efficient partially parallel decoder architecture for QC- 
LDPCCC has been proposed in this paper. The dedicated 
Block Processing Unit is also proposed such that the com- 
plexity overhead of the switch network can be removed. Rate- 
5/6 LDPCCC decoders of different sub-matrix sizes have 
been implemented on an Altera FPGA with our proposed 
architecture. It is found that our decoders can achieve a 
throughput of 2.0 Gb/s. Experimental results further show that 
QC-LDPCCCs outperform their block-code counterparts under 
the same throughput and similar overall decoder complex- 
ity. Moreover, the QC-LDPCCCs derived from well-designed 
block codes can achieve an error floor of lower than 10 -13 . 
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